It’s always a big question how the market value of a player is calculated. This project embarks on a statistical journey into the intricate web of attributes that define football players, exploring how these facets influence the players’ market values. In an era where data reigns supreme, understanding the nuances of a player’s performance, skills, and career trajectory becomes paramount for gaming enthusiasts, sports analysts, and football aficionados alike.
The FIFA Football Players Dataset is a comprehensive collection of information about football (soccer) players from around the world.
In this dataset we had access to all the potencial feautures that characterize a football player:
| player | country | height | weight | age |
| club | ball_control | dribbling | marking | slide_tackle |
| stand_tackle | aggression | reactions | att_position | interceptions |
| vision | composure | crossing | short_pass | long_pass |
| acceleration | stamina | strength | balance | sprint_speed |
| agility | jumping | heading | shot_power | finishing |
| long_shots | curve | fk_acc | penalties | volleys |
| gk_positioning | gk_diving | gk_handling | gk_kicking | gk_reflexes |
After conducting thorough analysis and visualization, we concluded that analyzing both goalkeepers and field players together would not be optimal due to the distinctiveness of representative features. Following this thought process, we decided to exclude all goalkeeper (GK) features. As part of the preprocessing steps, we further removed the player, country and club features, as we determined that they would not significantly impact the estimation. Through visualization, a noticeable gap in values between 20 and 40 for GK features became apparent. Consequently, we established that any player with a GK_feature above 30 is indeed a goalkeeper, and we adopted this threshold to filter them out. Additionally, the marking variable was eliminated from the model since it only took the values ‘nan’ and 0, contributing nothing substantial. This streamlined approach ensures a more focused and effective modeling process.
Relationship
The plot relating the target variable and the features revealed a quadratic relationship among almost all of them. Considering this, we decided to apply the logarithmic function to the value variable. By doing so, we aimed to capture linearity in our subsequent analysis. This transformation ensures that each unit change in one variable now signifies a percentage change in the prediction.
Correlation
Our first step on feature importance was to understand which features were correlated with others. Our goal was to identify and remove them to reduce data redudancy on the future model. Dribbling, short_pass, slide_tackle, stand_tackle, att_position, acceleration , long_shots, volleys were dropped at this point, as they a correlation > 0.8 with other features. We then procceded to search for correlation between variable and target variable, to understand which features were not useful to estimate the market variable of a player. With this in mind, height, weight, balance, jumping, age, interceptions, strength and _sprint_speed_were removed due to lack of truthfull relationship with value.
We constructed two linear regression models. The first model (reg) incorporated an extensive set of all predictors. Subsequently, the second model (reg2) was refined to include only those variables that demonstrated significant predictive power for the player’s value. This evaluation was conducted using the t-test within the summary function.
Model Effectiveness:
Our initial model explained approximately 81.87% of the variance in players’ market values. After refining the model, this means taking out of the regression all the features that showed a p-value > 0.01, as 1% was the significance level we figured would be a good approach,we maintained a high explanatory power of around 81.8%, indicating a robust model with fewer variables. Variables like reactions, ball_control, composure, agility , etc… were found to be significantly associated with players’ market values, with p-values close to 0, reinforcing their importance in the model.
Heteroskedasticity Concerns:
Tests for heteroskedasticity, including the Breusch-Pagan and White tests, indicated the presence of non-constant variance in the residuals. This means that the OLS is not BLUE and the estimators of the variance are biased. Even tho the OLS estimators remain unbiased they are no longer the most efficient ones. This can be duo to factores like differences in player positions or the nonlinear effect of certain variables like age. With this in mind we decided to apply Robust Estimation to obtain parameter estimates and standard errors that are more reliable in the presence of data anomalies or assumption violations. Despite the heteroskedasticity identified, the variables in the model have a meaningful relationship with the dependent variable once estimation is adjusted for non-constant variance in the errors.
Model Specification:
Building upon our preliminary model (reg2), we conducted a RESET test to check for model specification errors. The results suggested the need for additional or alternative predictors. Consequently, we constructed a final model (model_final) that included squared terms and interaction effects, providing a more nuanced understanding of the relationships. \[ \begin{aligned} \text{value} =\ & 11.73 - 0.0326 \times \text{reactions} - 0.1055 \times \text{ball_control} \\ &- 0.0070 \times \text{vision} + 0.0148 \times \text{stamina} - 0.0158 \times \text{fk_acc} \\ &- 0.0298 \times \text{heading} + 0.0592 \times \text{penalties} + 0.0077 \times \text{agility} \\ & - 0.0011 \times \text{heading} \times \text{penalties} + 0.0002 \times \text{fk_acc} \times \text{crossing}\\ & + 0.0007 \times \text{reactions}^2 + 0.0014 \times \text{ball_control}^2 \\ & + 0.0010 \times \text{heading}^2 \end{aligned} \]
Predictive power
Our final model achieved an R-squared value of 0.868, a substantial improvement of nearly 5% over the initial model. This indicates that 86.8% of the variability in players’ market values is explained by our selected predictors, reflecting excellent explanatory power. All features in the final model exhibited p-values lower than 0.01, affirming their statistical significance in predicting players’ market values. With the following plot we can conclude that the log-level model is close to meeting the assumption of normality of residuals.
Despite the retained variables’ strong predictive power and statistical significance, the study acknowledges unquantifiable factors influencing market values, such as fan popularity, marketability, and agent negotiation skills. The conclusion also notes heteroskedasticity challenges, suggesting the model doesn’t consistently account for variance across all independent variable levels. While the model provides a robust framework for understanding quantifiable aspects of market values, it highlights the complexity of player valuation. It underscores that statistical models may not fully capture the multifaceted nature of a player’s worth, with factors like personal branding and agent influence playing pivotal roles.
For industry professionals and enthusiasts, the study serves as a reminder of the valuation process complexity. It suggests that while statistical models capture critical factors, they may not fully encapsulate a player’s worth. Future research should explore innovative modeling or interdisciplinary approaches to incorporate less tangible aspects and build a more comprehensive understanding of player valuation. Evolving models will contribute to informed decision-making in the sports industry.